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In this study the magnitudes of local dependence generated by cloze test 
items and reading comprehension items were compared and their impact on 
parameter estimates and test precision was investigated. An advanced 
English as a foreign language reading comprehension test containing three 
reading passages and a cloze test was analyzed with a two-parameter logistic 
testlet response model and a two-parameter logistic item response model. 
Results showed that the cloze test produced substantially higher magnitudes 
of local dependence than reading items, albeit the levels of local dependency 
produced by reading items was not ignorable. Further analyses demonstrated 
that while even substantial magnitudes of testlet effect does not impact 
parameter estimates it does influence test reliability and information. 
Implications of the research for foreign language proficiency testing, where 
testlets are regularly used, are discussed. 


Testlets are sets of items grouped together around the same stimuli 
such as shared reading passages, scenarios, figures, or tables. Testlets have 
been lauded in educational testing on the following grounds: (a) they save 
testing time as it is more efficient both for test developers and test takers to 
have a number of items following a common stimulus than to have just a 
single item, (b) unlike atomistic decontextualized items such as multiple- 
choice items, testlets, which are a combination of linked items, may 
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increase authenticity of test tasks by providing more context, (c) testlets 
provide solutions to some of the problems associated with adaptive tests. In 
adaptive tests, where examinees take dissimilar sets of items, context effects 
due to item location, cross information, or unbalanced content may 
introduce construct-irrelevant variance into the assessment. Testlets can 
diminish these contextual effects by forming fixed item-content units 
(Wainer, Bradlow, & Wang, 2007). 

Despite their appealing features, testlets may introduce additional 
sources of construct-irrelevant variance. Items grouped under the same 
testlet might correlate with each other over and above the influence of the 
latent trait. The interrelatedness is likely to lead to a problem known as 
local item dependence (LID) in educational testing. A critical assumption in 
all standard statistical models in general and educational testing in 
particular is independence of observations. Items are said to be locally 
dependent if a person’s response to an item is dependent on his response to 
another item. Local independence assumption is obtained when persons’ 
responses to test items are affected solely by the trait intended to be 
measured by the items. When the contribution of the intended latent trait is 
removed, correlations between the items should be zero, unless there is a 
secondary dimension affecting responses. This subsidiary dimension might 
arise due to person-related characteristics such as differences in motivation 
or attention, differences in background knowledge, and ambiguities in the 
information provided in the input (Yen, 1993; Yen & Fitzpatrick, 2006). 
Using standard item response theory (IRT) models when LID is present 
may lead to problems such as biased item difficulty and discrimination 
parameters, overestimation of the precision of person ability estimates, 
overestimation of test reliability and test information, and underestimation 
of the standard errors of parameter estimates (Wainer et al., 2007; Yen & 
Fitzpatrick, 2006). Ignoring LID might lead to sever problems in judging 
psychometric qualities of tests which might in turn result in serious 
consequences regarding test score interpretation and use. In computer 
adaptive tests, for example, overestimation of the precision of person 
parameters might lead to premature termination of the test, where precision 
of ability estimate is the termination criterion (Wainer & Wang, 2000). Also 
in classical test theory LID can lead to overestimation of reliability which is 
the result of high intercorrelations among items in the same testlet over and 
above the construct of interest. Zhang (2010) showed that ignoring testlet 
effect can also diminish classification accuracy of examinees. 

LID has been addressed in one of the following ways in the litrature: 
(1) Score-based polytomous item response theory models such as the 
graded response model (Samejima, 1969), polytomous logistic regression 
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(Zumbo, 1999), partial credit model (Masters, 1982), and rating scale model 
(Andrich, 1978) have been fitted to testlet data (Baghaei, 2010). In these 
models each testlet with m questions is treated as a super item with the total 
score ranging from 0 to m, (2) Item-based testlet response theory models 
(TRT) such as the 2-PL TRT (Bradlow, Wainer, & Wang, 1999), 3-PL TRT 
(Wainer, Bradlow, & Du, 2000), and the Rasch testlet model or 1-PL TRT 
(Wang & Wilson, 2005) have been employed, and (3) Item-based multilevel 
testlet response thoery models such as the three-level testlet response theory 
model (Jiao, Wang, & Kamata, 2005) or the two-level cross-classified 
testlet response theory model (Beretvas & Walker, 2012; Ravand, 2015) 
have been employed. 

Score-based approaches to LID are limited in that: (1) They would 
lead to loss of information since they do not take into account the exact 
response patterns of test takers to individual items within a testlet, that is, 
the difference in the response patterns of the examinees with the same sum 
scores is not known (Wainer, et al., 2007) and (2) the model only works 
when LID magnitude is moderate and there are many independent items 
(Wainer, 1995). 

To account for LID without loss of information, Bradlow, Wainer, 
and Wang (1999) advanced a TRT model which is an extension of the 2 PL 
IRT model (Birnbaum, 1968). The 2 PL TRT is multidimensional IRT 
model (Reckase, 2009; see also Baghaei, 2012) which includes a random 
effect parameter, y , to account for the interdependencies of the items 
within the same testlet. According to this model, the probabiltiy of a correct 
answer to an item i nested in testlet <:/(/) for a person n with ability 6 n is 
expressed as: 


expK(6>,-b,-j/„, (i) )] 

1 + exp 17/ (6^ - b,.- y nd(l] )]' 


( 1 ) 


where a j and b,are the item discriminationation and difficulty parameters, 
respectively, and y nd(i) is the testlet effect parameter for persons on testlet 
d(i ). 

The distinctive component of the model, as compared to the standard 
2-PL model is the introduction of the random effects parameter y , which 
represents the local dependency within each testlet d(i ). TRT yields two 
person ability parameters: a general abiltiy # H and a testlet specific ability 
Y ndii) . The testlet specific parameter is caused by person characteristics such 
as background knowledge, passage dependence, motivation, etc. y nd(i) is 
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common (i.e. fixed) across items and random (i.e., varying) across persons 
(Wainer et al., 2007; Wang & Wilson, 2005). Introduction of the random 
effects parameter makes TRT a special case of multidimensional IRT 
models, i.e., a bifactor model, where each item simultaneously loads on two 
factors (dimensions): an overall ability dimension and a testlet-specific 
dimension. 

When the y = 0, i.e. when there is no testlet effect, the assumption of 
local depnedence holds and the model reduces to a standard 2-PL model. 
The higher the variance of y the greater the LID. It’s worthy of note that 
Equation 1 reduces to the Rasch testlet model if the discrimination 
pare meter a i is the same for all the items. 

TRT models have been applied extensively to model LID. Wainer and 
Wang (2000) applied the 3-PL TRT model to analyze LID in the reading 
and listening comprehension sections of the Test of English as a Loreign 
Language (TOELL). They found that LID due to testlet effect did not affect 
difficulty estimates but resulted in biased discriminationa and guessing 
parameter estimates. They also compared the 3-PL TRT results with those 
obtained from a standard IRT model. They found that when LID was 
ignored, there was an overestimation in test information by 15% for some 
ability levels. In another application of the 3-PL TRT, Chang and Wang 
(2010) explored testlet effect in the Programme for International Student 
Assessment. In line with Wainer and Wang’s (2000) study they found 
negligible effect of LID on item difficulty estimates. However, they found 
item discrimination and the precision of examinee ability meaures were 
overestimated. Zhang (2010) studied the effect of ignoring LID on 
examinee classification accuracy. He found that standard errors of abiltiy 
estimates obtained under the 3-PL TRT were sizably higher than those 
based on the standard IRT model, erroniously implying higher measurement 
precision in the estimates. Along the same lines, in the context of mastery 
classification for criterion referenced tests Baghaei (2007) demonstrtaed 
that ignoring LID can lead to erronious decisions especially near the cut- 
score. 

In two more recent studies Eckes (2014) and Eckes and Baghaei 
(2015) employed the 2-PL TRT model to explore testlet effects. Eckes 
(2014) studied testlet effect in the listening section of Test of German as a 
Loreign Language (TestDaf). He found that ignoring LID led to 
overestimation of relaibility and underestimation of standard errors of 
ability estimates. He also found that item discrimination and difficulty 
estimates were not severly affected. Eckes and Baghaei (2015) examined 
local dependence in the items of a C-test. They found that testlet effects 
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were very small hence parameters obtained under the 2-PL TRT and those 
obtained from the standard IRT models were highly comparable. They also 
found that when LID was ignored the C-test reliability was overestimated. 

Cloze test and reading comprehension 

A classical cloze test consists of a single longer passage in which 
every n th word is deleted where the test takers have to supply the missing 
words (Oiler, 1979). Researchers are divided on the issue of what cloze 
tests measure. The long-running argument has largely concerned whether 
cloze tests measure sentence-level grammatical knowledge or global text 
comprehension. Some studies have argued that cloze tests are appropriate 
for measuring reading comprehension ability (Chihara, Oiler, Weaver, & 
Chavez-Oller, 1977; Bachman, 1985; Jonz, 1990; McKenna and Layton, 
1990; Chavez-Oller, Chihara, Weaver, & Oiler, 1985). They have shown 
that cloze tests are sensitive to text-level constraints. On the other hand, 
some other studies have concluded that cloze tests are measures of local 
syntactic constraints (e.g., Alderson, 1979; 1980; Kibby, 1980; Shanahan et 
al., 1982; Markman, 1985). However, there is “some consensus among 
researchers that not all deletions in a given cloze passage measure exactly 
the same abilities” (Bachman, 1985, p.535). Bachman (1985) concludes that 
a possible source of inconsistency among the results of studies on construct 
validity of cloze tests is that these studies have not distinguished between 
the cloze types. The majority of the studies have created the cloze tests 
through the fixed-ratio deletion procedure (every n th word deleted). 
Bachman prepared two versions of cloze from the same passage: fixed-ratio 
and rational cloze. He classified the knowledge type required to restore the 
missing words in the rational cloze as: (1) within clause, (2) within 
sentence, (3) within text, and (4) extra-textual. He concluded that the 
majority of the words deleted in the every-nth-words-deleted procedure 
(i.e., fixed-ratio) were of the clause-level (Type 1) or the extra-textual 
(Type 4) and there were few gaps requiring within-sentence and within-text 
context. The results of his study also indicated that through a rational 
deletion method, test developers can include deletions of the Types 2 and 3 
which can be restored using textual understanding. In line with the finding 
of Bachman, Alderson (2000) reserved the term cloze for the fixed-ratio 
procedure and “gap-filling” for the conventionally called rational deletion 
cloze tests. He argues that gap-filling tests can be used as reading 
comprehension tests. Yamashita (2003) investigated construct validity of a 
gap-filling test using verbal protocols of the test-takers’ response processes 
categorized according to the classification framework developed by 
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Bachman (1985). The results of his study showed that readers at different 
ability levels used text-level processing more frequently. He concluded that 
gap-filling tests can be used to measure higher-level reading comprehension 
ability. 

Along the same lines, Greene (2001) found out that rational deletion 
cloze tests and true-false reading comprehension tests produce the same 
mean and dispersion among college level students. Greene (2001) argued 
that a cloze test whose items have been carefully selected to tap into test 
takers’ inference making ability can “measure a reader's macroprocessing of 
theoretical text” (p. 92). 

Bormuth (1969) studied the factorial validity of cloze test by 
administering nine cloze tests and seven multiple-choice reading 
comprehension passages to a sample of grade four, five, and six students. 
Exploratory factor analysis yielded one factor with an eigenvalue greater 
than one which accounted for 77% of the variance. All cloze tests and 
reading comprehension passages loaded on this factor which was named 
reading comprehension ability factor. Further correlational and factor 
analytic evidence for the validity of cloze as a measure of reading 
comprehension has been provided by Rankin (1959), Bormuth (1967, 
1968a, 1968b), Rye (1982), Harrison, (1980), and Klare, (1984). 

Purpose of the study 

One key issue in cloze tests which poses problems for the analysis and 
scaling of items and persons is the interdependency of cloze test items. 
Dependency among cloze items is a violation of the local item 
independence assumption addressed above. This restricts analysis of cloze 
test items with IRT models and the estimation of its reliability with internal 
consistency methods such as Cronbach’s alpha (Bachman, 1990; Farhady, 
1983). 

The same problem is encountered for the analysis of reading 
comprehension items as such items are usually clustered around a passage 
which makes them a testlet and hence may lead to the violation of local 
independence. In language testing literature it is commonly stated that due 
to the interconnected structure of cloze test items internal consistency 
reliability estimates such as Cronbach’s alpha are not appropriate for 
estimating reliability and test-retest and parallel-forms methods are 
suggested instead (Bachman, 1990; Farhady, 1983). This admonition is also 
given for C-Test which is a special form of cloze test. In a C-Test instead of 
deleting entire words the second half of every other word is deleted and 
examinees have to restore the missing letters (Grotjahnl987; Klein-Braley, 
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1997). As a C-Test battery is always composed of four to six short texts 
correct answers on each text is aggregated and each passage is entered into 
the analysis as a polytomous item (Baghaei, 2011; Raatz &Klein-Braley, 
2002; Sigott, 2004). Obviously this is done to circumvent the local 
dependence which seems to be present in C-Test. 

In this study we aim to address the issue of LID in cloze tests and 
reading comprehension tests by comparing the magnitude of LID each 
generates and its effect on parameter estimates and test precision. To 
address these issues the following research questions were formulated: 

1. To what extent reading comprehension test items and the cloze test 
items introduce local item dependency (LID)? Which test type, whether 
cloze or reading, is more affected by local dependency? In other words, 
which kind of test is a more intense source of LID? The general impression 
is that due to the interconnected structure of items in cloze tests they induce 
more dependency (Bachman, 1990; Farhady, 1983) than reading 
comprehension items which are at least structurally independent of each 
other. It is interesting to put this assumption into an empirical test and 
compare the two test types in terms of generating LID. 

2. How do person and item parameters obtained under the standard 2- 
PL IRT model, where LID is ignored, compare with those obtained under 
the 2-PL TRT model where LID is accounted for? Answer to this question 
helps gauge practical consequences of local dependence for measurement in 
the contexts where testlets are regularly used. 


METHOD 

Instrument and Participants. Participants of the study were two 
random subsamples of the Iranian National University Entrance 
Examination (INUEE) candidates (iV ; =5412, 71.3 % females and 26.8 % 
males and A 2 =5374, 69.3 % females and 28.9 % males) who applied for the 
English master programs in state universities in 2012. INUEE, as the sole 
selection criterion, is a high-stakes test that screens the applicants into 
English Studies programs at M.A. level in state-run universities in Iran. The 
INUEE is administered once a year in February. The participants were 
Iranian nationals and the majority of them held a B.A. degree in English 
Studies (88 %). 

The INUEE measures general English proficiency at an advanced 
level and content knowledge of the applicants in teaching methodology, 
principles of language testing, and linguistics. The general English section 
consists of 10 grammar items, 20 vocabulary items, 20 reading 
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comprehension items, and 10 cloze test items. Cloze is deemed to be a test 
of general language proficiency in a foreign language (Oiler, 1972, 1979) 
and a test of reading comprehension by other researchers (Bormuth, 1969; 
Greene, 2001; Rye, 1982). Inspection of the cloze test analysed in the 
present study clearly revealed that the deletion method is not fixed-ratio 
deletion as the distances among the gaps were not equal. Most of the 
deletions were cohesive devices and key content words which required text- 
level understanding. Therefore in can be argued that the cloze test is a 
rational-deletion cloze and, therefore, a test of reading comprehension. 
Nevertheless, there is no information in the test documentation on the 
rationale behind the deletions. 

All the questions were multiple-choice and test takers had to complete 
the items in 60 minutes. For the purpose of the present study only the 
reading comprehension section and the cloze test were used. The reading 
comprehension section was composed of three passages on academic topics 
with seven, six and seven items following each passage, respectively. The 
cloze test was 4-option multiple choice with 10 items (gaps). 


Data analysis. Two IRT models were separately fitted to the selected 
samples of the study: (1) a standard 2-PL IRT model (Bimbaum, 1968) 
where it is assumed that no LID exists and (2) a 2-PL TRT model (Bradlow 
& Wainer, & Wang, 1999), where LID is systematically modeled and 
conditioned out 1 . SCORIGHT computer programme (Version 3.0; Wang, 
Bradlow, & Wainer, 2005) was used for the analyses. To estimate the 
model parameters Bayesian estimation techniques are implemented in 
SCORIGHT. Bayesian methods incorporate prior information about model 
parameters to facilitate estimation (Fox, 2010; Gelman, Carlin, Stem, & 
Rubin, 2003). In SCORIGHT, inferences for unknown parameters are 
obtained by drawing samples from their posterior distributions using 
Markov Chain Monte Carlo (MCMC) techniques (Wang, et al., 2005). The 
2-PL IRT and 2-PL TRT models were fitted to the data using Markov chain 
Monte Carlo (MCMC) methods. Five chains were run. For each chain, after 
a bum-in period of 4000 iterations, as is advised to be sufficient by Wang, 
et al. (2005), the next 1000 iterations were used for inferences, where every 
tenth draw was retained to reduce the high autocorrelation. The MCMC 
sampler converged properly as the potential scale reduction factors for the 
prior and hyperprior parameters were all very close to 1.0. 


1 Since the test is multiple-choice we first fitted the 3-PL IRT and 3-PL TRT models to 
account for guessing too. But the model did not converge although we ran 10 chains with 
35000 iterations in each chain. 



Cloze test and reading comprehension 


93 


RESULTS 

Testlet effects 

Table 1 presents the magnitudes of y or testlet effects and their 
associated standard errors for each testlet in the two samples. As the table 
shows, cloze test generates the highest level of local dependency. This 
finding agrees with our expectations considering the structure of cloze test 
items. 


Table 1. Testlet statistics in samples 1 and 2. 


Testlet 

No. Items 

Estimate 

S. E. 

Sample 1 




Cloze 

10 

1.646 

0.143 

Passage 1 

7 

0.453 

0.058 

Passage 2 

6 

0.136 

0.024 

Passage 3 

7 

0.441 

0.039 

Sample 2 




Cloze 

10 

1.831 

0.163 

Passage 1 

7 

0.495 

0.055 

Passage 2 

6 

0.179 

0.027 

Passage 3 

7 

0.445 

0.040 


Mote: «j=5412. nz= 5374; S.E.: standard error 
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There are no accepted rule of thumb values for judging testlet effect 
parameters (Eckes & Baghaei, 2015). Simulation studies show that testlet 
effects smaller than .25 are negligible (Glas, Wainer, & Bradlow, 2000; 
Wang, Bradlow, & Wainer, 2002; Wang & Wilson, 2005; Zhang, 2010). 
Note that TRT is a bifactor model where items nested within a testlet are 
modeled to load on a specific testlet dimension (a group factor) while 
simultaneously loading on a general ability dimension. Testlet effect 
parameters, y, are in fact the variances of these specific factors. To judge 
the magnitude of the local dependence generated by a testlet, y is compared 
with the variance of the general ability dimension. The higher the variance 
of testlet specific dimensions compared to the variance of the general ability 
dimension, the more local dependence the testlet has generated (Baghaei & 
Aryadoust, 2015). In SCORIGHT for model identification the variance of 
the ability distribution is set to one with mean of zero. Therefore, testlet 
effects are compared with one. In Sample 1 the cloze test has a testlet effect 
equal to 1.646 which is substantially higher than the variance of the general 
ability dimension. Reading passages 1 and 3 have testltet effect estimates 
almost half the variance of the general ability dimension, which are not 
negligible. Reading passage 2 exhibits a benign magnitude of local 
dependency. Testlet effects are slightly higher in sample 2 but the same 
pattern is observed. 

Item parameters 

In standard IRT it is assumed that no local dependency among items 
exists while in TRT local dependency is factored out systematically by 
adding a random effect parameter y to the IRT item response function. In 
this section item parameters across the two models, i.e., TRT which 
accounts for LID and IRT where LID is ignored are compared. 

Table 2 shows the descriptive statistics for item parameters in the two 
models in sample 1. The root mean-square measurement error (RMSE) is 
the square root of the average of the parameter error variances which is an 
index of precision of estimation (Linacre, 2012). 

In Sample 1 discrimination parameters across the two models 
correlated at .989. The difference between a parameters estimated by each 
model for each item was computed. The absolute values of these differences 
ranged from 0.00 to .210 with a mean of .052 and a standard deviation of 
.049. The mean of squared absolute differences turned out to be .005 with a 
standard deviation of .008. The root mean square deviation was .070. 

Difficulty parameters across the two models correlated at .997. The 
difference between b parameters estimated by each model for each item was 
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computed too. The absolute values of these differences ranged from 0.00 to 
.640 with a mean of .152 and a standard deviation of .164. The mean of 
squared differences turned out to be .049 with a standard deviation of .095. 
The root mean square deviation was .221. RMSE’s show that while 
discrimination parameter is estimated with the same precision across the 
models the difficulty parameter is estimated with a higher precision under 
TRT. 


Table 2. Summary statistics for discrimination and difficulty 
parameters under TRT and IRT. 


Model 



a 





b 




M 

SD 

Max 

Min 

RMSE 

M 

SD 

Max 

Min 

RMSE 

TRT 

0.877 

0.436 

1.950 

0.150 

0.048 

2.204 

2.775 

9.580 

-0.980 

0.266 

IRT 

0.874 

0.399 

1.800 

0.150 

0.046 

2.095 

2.751 

9.530 

•0.770 

0.316 


Mote: M: mean; SD: standard deviation; Max: maximum; Min: minimum; RMSE: root mean 
square error 


Results in sample 2 were highly comparable with those in sample 1 as 
discrimination parameters across the two models correlated at .990 and 
difficulty parameters correlated at .997. The other indices were highly 
similar to those obtained in Sample 1, so they are not reported here. The 
discrimination parameters estimated across the two independent samples 
correlated at .866 under IRT and .877 under TRT. The difficulty parameters 
estimated in the two independent samples correlated at .803 under IRT and 
.812 under TRT. This shows that while parameter estimates remain 
relatively stable across independent non-overlapping samples the 
discrimination parameter has remained more invariant. IRT and TRT 
performed equally well as far as the stability of parameters estimation 
across different populations is concerned, with TRT slightly performing 
better. 
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Person parameters 

In Sample 1 person parameters across the two models correlated at 
.994 which indicates considerable correspondence in the two models. The 
difference between theta parameters estimated by each model for each 
person was computed. The absolute value of these differences ranged from 
0.00 to .380 with a mean of .079 and a standard deviation of .059. The mean 
of squared differences turned out to be .009 with a standard deviation of 
.013. The root mean square deviation was .094. Table 3 presents the 
summary statistics for the theta parameters in the two models along with 
root mean square errors and reliabilities. 


Table 3. Summary statistics for person ability estimates under TRT 
and IRT. 


Model 

M 

SD 

Max. 

Min. 

RMSE 

Rel. 

TRT 

0.00 

0.860 

2.543 

*2.003 

0.508 

0.650 

IRT 

0.00 

0.895 

2.726 

-2.096 

0.435 

0.763 


\ote: M: mean; SD: standard deviation: Max: maximum: 
Min: minimum; RMSE: root mean square error; 
Rel.: reliability 


Note that RMSE is the square root of the mean of the theta 
parameters’ squared standard errors. The reported reliabilities are Bayesian 
reliabilities (Wainer, et al., 2007) which are computed by dividing the 
variance of the expected a posteriori theta parameters by the same value 
plus the mean of squared person parameters standard errors. The mean of 
the ability distribution is set to zero for model identification. Table 4 shows 
that both models have produced the same amount of dispersion with IRT 
yielding a slightly wider distribution. The major difference in the two 
models is the smaller error of estimation and higher reliability of IRT 
estimates compared to the TRT model. 
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In sample 2 person parameters across the two models correlated at 
.996 which indicates considerable correspondence in the two models. The 
rest of the agreement statistics were highly comparable to those obtained for 
Sample 1. As was observed in Sample 1, the striking difference was in the 
reliability of the estimates. The reliability for TRT person parameters was 
.746 and for IRT was .814. 


DISCUSSION 

The assessment of reading comprehension almost always entails 
testlets. Reading comprehension tests make use of passages which are 
followed by several questions. This may be supplemented by a passage with 
missing words or letters to be filled in. Although using testlets in 
educational measurement is economic, efficient, and valid (Wainer, et al., 
2007) they pose problems for data analysis: Testlets violate the conditional 
independence assumption of IRT models and can lead to biased parameter 
estimates and spuriously low standard errors (Baghaei, 2010; Baghaei & 
Aryadoust, 2015; Wainer & Wang, 2000). The purpose of this investigation 
was (1) to examine the extent to which cloze test items and reading 
comprehension items generate local item dependence and (2) to assess the 
impact of local item dependence on item and person parameter estimates 
and their precision. 

Findings of the study indicated that, as expected, cloze test items 
generated substantially higher levels of local dependency than reading 
comprehension items. The LID magnitude in the cloze test was almost four 
times greater than the LID produced in reading passages. This is in line with 
Zhang (2010) who found a testlet effect magnitude of 1.43 for a cloze test 
with 20 items and testlet effect values of .58, .53, .35, and .59 for four 
reading comprehension passages (each with five items) in the Examination 
for the Certificate of Proficiency in English. However, the results are in 
contrast with those of Eckes and Baghaei (2015) and Schroeders, Robitzsch, 
& Schipolowski (2014) who found very small testlet effects for C-Test 
passages. 

This discrepancy can be attributed to the differences in the structures 
of cloze test and C-Test. Frequent half-word deletions in C-Tests (in 
contrast to every 5 th or 6 th full-word deletions in cloze) probably prevents 
application of text level processes by examinees. In other words, C-Test 
taking is more a local gap filling task without resort to higher-order text 
level processes. Another possible reason for the contrasting results between 
cloze and C-Test might be the language-specific characteristics and types of 
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the texts used to construct the cloze tests and the C-Tests used in these 
studies. Another reason for the different magnitudes of LID generated by C- 
Test and cloze test might be the text length. In cloze tests usually longer 
passages are used but in C-Tests several independent short passages are 
employed. Obviously short passages do not allow for text-level processes to 
be activated by the examinees. 

The reason for higher magnitude of LID in the cloze testlet compared 
to the reading comprehension testlets could be due to the fact that LID in 
cloze has two sources. Marais and Andrich (2008) distinguished between 
two types of LID: trait dependence and response dependence. Trait 
dependence (referred to as multidimensionality in the literature) occurs 
when a secondary trait or dimension is being measured by the test. 
Response dependence or item chaining effect (Wang & Wilson, 2005) 
occurs when answer to an item affects how subsequent items are answered. 
Higher LID for cloze tests are expected since they are more likely to be 
affected by both trait and response dependence. 

Another reason for the higher magnitude of dependency in the cloze 
test could be the method specific features of the cloze that may require 
certain abilities over and above language proficiency. As explained earlier, 
TRT is a bifactor model in which all items load on a general ability 
dimension and at the same time each item loads on a testlet specific 
dimension. One shortcoming of TRT is that unique variances in testlets are 
lumped together and assumed to be the variance due to LID only. While the 
unique variance could be a combination of many other factors such as 
testlet specific knowledge or test method variance (Baghaei & Aryadoust, 
2015). 

Unless we can argue that the cloze and reading passages are measures 
of the same construct, comparison of testlet variances across these two test 
types is not justified. The higher magnitude of dependency in the cloze test 
might simply reflect cloze specific abilities not shared with reading 
comprehension and not construct-irrelevant variance due to item 
dependency. As it was argued before, the literature on the construct validity 
of rational cloze tests has demonstrated that they measure higher-level 
reading comprehension ability. 

There is no way to disentangle these variances unless we have 
multiple cloze testlets as well as multiple reading testlets. This can be 
addressed with at least two cloze tests and two reading comprehension 
passages. A bifactor model is run and all items (cloze and reading) are 
forced to load on a reading factor while the cloze tests load on a cloze factor 
as well. Furthermore, each testlet loads on a testlet specific factor. This way 
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we can separate testlet variance from cloze specific variance in cloze 
passages. 

Despite observing large to huge testlet effects in this study a great 
agreement was observed in person and item parameter estimates across the 
two models as was shown by extremely high correlations between the 
parameters estimated by the two models. On the surface, the level of local 
dependence did not affect the estimates of parameters of interest in IRT 
where local dependence was ignored. But the observed differences in the 
estimates across the models for some items and persons were as high as 
0.21, .64, and 0.38 for discrimination, difficulty, and person ability 
parameters, respectively. Such differences could have adverse consequences 
when high-stakes decision making is involved. Furthermore, the two models 
performed equally well in in terms of stability of parameter estimation 
across different populations. 

Perhaps the most important ramification when the inappropriate 
model is employed is the reliability of estimates. It is generally argued that 
local dependency leads to biased parameter estimates including inflated 
item discrimination estimates, overestimation of precision of person ability 
estimates, and overestimation of test reliability and test information 
(Baghaei, 2010; Zhang, 2010). The results of the present study showed that 
despite observing very high magnitudes of testlet effect item and person 
parameters across the two models were closely comparable. The only 
substantial difference observed between the two models was in reliability 
and the precision of ability estimates. IRT overestimated test reliability and 
the precision of person ability parameters. This can lead to serious problems 
when computer adaptive testing is used, where the criterion for test 
termination is the standard error of person estimates, i.e., it leads to 
premature test termination (Wainer & Wang, 2000). Ip (2010) showed that 
when LID exists information functions are wrong and testlet effect, which is 
present in language assessments, results in overestimation of classification 
accuracy due to the underestimated measurement error (Zhang, 2010). 

The TRT approach employed in this study, models testlet effect as 
random. However, testlet effects can also be assumed as fixed (Beretvas & 
Walker, 2012). In the fixed-effects approach testlet effect is assumed to be 
constant over persons. In other words, LID is an item characteristic. On the 
other hand, in the random-effects approach testlet effects are assumed to be 
varying (random) over persons; testlet effects are different for people with 
different learning experiences, background knowledge, or levels of interest. 
According to Wang and Wilson (2005), the random-effects approach is 
more appropriate where trait dependence is present, whereas the fixed- 
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effects approach is more suitable where item-chaining effect or response 
dependence is suspected. With tests such as cloze where both types of 
dependencies are suspected, an approach which takes into account both 
fixed and random effects would be more appropriate. The two-level testlet 
response model (MMMT-2) proposed by Beretvas and Walker (2012) 
models both fixed and random testlet effects. The random testlet effect in 
MMMT-2 corresponds with a secondary dimension that the testlet might be 
measuring. In other words, the random testlet effect is the effect of the 
testlet on the difficulty of the items within the testlet, which is due to the 
secondary dimension targeted by the testlet. However, the fixed testlet 
effect can be interpreted as the direct contribution of the testlet to the 
difficulty of an item on the primary dimension being measured by the item. 
In other words, the fixed testlet effect represents the effect of testlet on the 
diffuculty of the items, which is due to the primary dimension intended to 
be assessed by the testlet. 

Limitations and Suggestions for Further Research 

The current study compared the magnitude of local item dependence 
generated by cloze and reading comprehension items and its impact on 
parameters estimates and their precision in an advanced test of English as a 
foreign language. Findings showed that while even substantial magnitudes 
of testlet effect does not impact parameter estimates it does influence the 
test reliability and information. 

The generalizability of the present study is limited in that there were 
relatively few testlets, especially for the cloze test, and too many test takers. 
To better compare testlet effect in cloze and reading comprehension tests, 
further research can include more cloze and reading testlets. Generalization 
of the findings of the present study might be further limited by the difficulty 
of the items studied. It is suggested future studies compare testlet effect in 
testlets with easier items. Since a prerequisite in comparing LID across 
cloze and reading comprehension tests is that they measure the same latent 
trait, it is suggested that future studies develop rational cloze tests with gaps 
according to the framework suggested by Bachman (1985) with gaps of 
Type 2 and 3 (i.e., within sentence and within text) and then compare LID 
across reading and cloze. As noted before another strategy is to have 
multiple cloze tests and multiple reading testlets and use a bifactor model to 
disentangle the testlet effect from the cloze specific variance in cloze tests. 
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